AMENDMENT UNDER 37 C.F.R. § 1.111 
Appln. No.: 09/580,523 



Atty. Docket No.: A7483 



REMARKS 

L Amendments to the Claims 

Claims 1, 3, 10, 13, 16, 19, 22, 25 and 31-62 and 70 are all the claims currently pending 
in the application. Of these claims, claims 31-62 have been withdrawn from consideration; 
claims 1, 10, 13, 16, 19, 22 and 25 have been rejected; and claim 3 has been objected to. No 
claims have been allowed. 

After entry of this Amendment, claims 31-62 and 71-108 will be all the claims pending in 
the application. 

Claims 1, 3, 10, 13, 16, 19, 22, 25 and 70 have been canceled. 
New claims 71-108 have been added. 

New claims 71-80 are directed to polypeptides comprising an amino acid sequence that 
has at least 75% sequence identity with SEQ ED N0:1, wherein the amino acid at the position 
corresponding by sequence alignment to position 118 of SEQ ED N0:1 is alanine or an amino 
acid conservative for alanine, and wherein the polypeptide has at least one in vitro activity 
selected from the group consisting of: cell death promoting activity, Bc1-Xl binding activity, 
and Bcl-2 binding activity. These claims are supported throughout the specification. In 
particular, polypeptides of the present invention having at least 75% sequence identity with SEQ 
ID NO: 1 are described at page 45, lines 13-19. Support for the recitation of alanine or an amino 
acid conservative for alanine at the position corresponding by sequence alignment to position 
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118 of SEQ ID NO: 1 may be found in the specification in Example 8 (pages 86-87) and pages 
41, 45, and 50. Polypeptides having the recited activity are described at, for example, page 9, 
lines 19-22, 

Claims 81-88 are directed to polypeptides comprising an amino acid sequence that has at 
least 75% sequence identity with amino acids 114 to 122 of SEQ ID N0:1, wherein the amino 
acid at the position corresponding by sequence alignment to position 118 of SEQ ID N0:1 is 
alanine or an amino acid conservative for alanine, wherein the polypeptide has at least one in 
vitro activity selected from the group consisting of: cell death promoting activity, Bc1-Xl 
binding activity, and Bcl-2 binding activity, and wherein the polypeptide is at least 10 amino 
acids long. These claims are supported throughout the specification, for example at page 42, 
Hnes 16-19 (describing biologically active fragments of mutant BAD containing a domain 
"substantially identical" to the BH3 domain of SEQ ID NO: 1), page 13, lines 15-16 (defining 
the BH3 domain as amino acids 114-122 of SEQ ID NO: 1), page 45, hnes 13-16 (defining 
"substantially identical" to include at least 75%, preferably 85%, and more preferably 90-95% 
sequence identity), and page 42, line 5 (indicating that a fragment of the present invention must 
be at least 10 amino acids long). Page 42, line 6 indicates that fragments of the present invention 
are preferably at least 25 amino acids long (claim 88). 

Claims 89-98 are directed to polypeptides comprising an amino acid sequence that has at 
least 75% sequence identity with amino acids 103-123 of SEQ ID NO:l, wherein the amino acid 
at the position corresponding by sequence alignment to position 118 of SEQ ID N0:1 is alanine 
or an amino acid conservative for alanine, and wherein the polypeptide has at least one in vitro 
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activity selected from the group consisting of: cell death promoting activity, Bc1-Xl binding 
activity, and Bcl-2 binding activity. These claims are supported throughout the specification, for 
example at page 12, lines 8-12 (describing fragments of mutant BAD comprising amino acids 
103-123 of SEQIDNO: 1). 

Claims 99-108 are directed to polypeptides comprising an amino acid sequence that has 
at least 75% sequence identity with amino acids 106-132 of SEQ ID N0:1, wherein the amino 
acid at the position coiresponding by sequence alignment to position 118 of SEQ ID N0:1 is 
alanine or an amino acid conservative for alanine, and wherein the polypeptide has at least one in 
vitro activity selected from the group consisting of: cell death promoting activity, Bc1-Xl 
binding activity, and Bcl-2 binding activity. These claims are supported throughout the 
specification, in particular at page 43, lines 5-8 in combination with Table 1 at page 41 (defining 
the BH3 domain as amino acid residues 143-168 of SEQ ID NO: 2, which corresponds to 
residues 106-132 of SEQ ID NO: 1). 

XL Objections to the Claims 

At page 2, paragraph 1 of the Office Action, claims 1, 3, 10, 16, 19, 22, 25 and 70 were 
objected to, for the use of the language "BH3 domain" as the sole means of identifying the 
claimed fragment. 

Specifically, the Examiner noted that different laboratories may use the same laboratory 
designations to define completely distinct peptide fragments, and required that the claims be 
amended to unambiguously define "BH3 domain.'* 
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New claims 81-108 recite the specific amino acid residues of SEQ ID NO: 1 that define 
the BH3 domain, rather than "the BH3 domain" of mutant BAD (see above). Accordingly, 
Applicants respectfully request reconsideration and withdrawal of this objection. 

11. Rejection of Claims Under 35 U.S.C. § 102(e) 

At page 2 of the Office Action, claims 1, 10, 16, 19, 22, 25 and 70 were rejected under 35 
U.S.C. § 102(e) as being anticipated by Home et al., U.S. Patent No. 5,965,703. 

The Examiner stated that the cited patent teaches the sequence of human BAD (citing SEQ 
ID NO: 2, claim 1, and column 3, lines 25-32 of the reference), which is 100% identical to the full 
length of the claimed SEQ ID NO: 1, from amino acids 1 to 168. The Examiner contended that in 
view of a lack of a definition for "corresponding," any amino acid including alanine or an amino acid 
conservative for alanine, at any position of SEQ ID NO: 2, could "correspond" to position 1 18 of 
SEQ ID NO: 1. Similarly, the Examiner contended that any amino acid of SEQ ID NO: 2 could 
"correspond" to positions 103-123 of SEQ ID NO: 1. Thus, according to the Examiner, the claimed 
mutant BAD and fragment thereof appear to be the same as the prior art polypeptide and fragment 
thereof. 

New claims 71 to 108 recite polypeptides wherein the amino acid at the position 
corresponding by sequence alignment to position 118 of SEQ ID N0:1 is alanine or an amino 
acid conservative for alanine. In contrast, SEQ ID NO: 2 of the '703 patent includes a serine 
residue at position 118. Accordingly, SEQ ID NO: 2 does not anticipate the presentiy claimed 
polypeptides. 
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Furthermore, Applicants respectfully submit that "the amino acid position corresponding by 
sequence alignment to position 1 18 of SEQ ID NO: 1," is clearly defined in the specification with 
regard to any of the homologues and fragments encompassed by the present claims. 

For example, pages 9, 41 and 45 of the specification describe how sequence alignment allows 
identification of regions of sequence homology (see also page 25 and 26 of the November 19 
amendment). In particular, the specification notes that sequence alignment allows "identification of 
the serine, or other amino acid, at a position corresponding to the serine at position 1 18 of SEQ ID 
NO: 1" (page 45, lines 8-11). Similarly, page 50 of the specification indicates that the position 
corresponding to Ser 1 18 of SEQ ID NO: 1 can be identified by the alignment of the sequence of a 
mutant BAD or fragment of a mutant BAD with SEQ ID NO: 1. In addition, Table 1 at page 41 
illustrates how sequence alignment can be used to identify the position corresponding to 118 of 
SEQ ID NO: 1 in the mouse BAD sequences, which vary slightly from SEQ ID NO: 1. 

Finally, numerous widely available references, such as Tatusova and Madden (1999), 
"Blast 2 sequences - a new tool for comparing protein and nucleotide sequences," FEMS 
Microbiol. Lett. 174:247-250 and Altschul et al. (1997), "Gapped BLAST and PSI-BLAST: a 
new generation of protein database search programs," Nucleic Acids Res. 25:3389-3402 
(submitted herewith), describe the use of sequence alignment to compare related polypeptide 
sequences. 

Thus, the amino acid residue of SEQ ID NO: 2 of Home et al. that corresponds by sequence 
alignment to position 118 of SEQ ID NO: 1 is clearly a serine. Accordingly, the present claims are 
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not anticipated by this reference, and Applicants respectfully request reconsideration and withdrawal 
of this rejection. 

III. Conclusion 

In view of the above, reconsideration and allowance of this application are now believed 
to be in order, and such actions are hereby solicited. If any points remain in issue which the 
Examiner feels may be best resolved through a personal or telephone interview, the Examiner is 
kindly requested to contact the undersigned at the telephone number listed below. 



The USPTO is directed and authorized to charge all required fees, except for the Issue 
Fee and the Publication Fee, to Deposit Account No. 19-4880. Please also credit any 
overpayments to said Deposit Account. 

Respectfully submitted. 




SUGHRUE MION, PLLC Lisa E. Stahl 

Telephone: (202) 293-7060 Registration No. 56,704 

Facsimile: (202) 293-7860 

WASHINGTON OFFICE 

23373 

CUSTOMER NUMBER 

Date: July?, 2005 



29 




mm 



MICROBIOLOGY 

T^T CT7AfTt70 LETTERS 
hLSEVIER FEMS Microbiology Letters 174 (1999) 247-250 === 



BLAST 2 Sequences, a new tool for comparing protein and 

nucleotide sequences 

Tatiana A. Tatusova *, Thomas L. Madden 

National Center for Biotechnology Information ( NCBI), National Library of Medicine, National Institutes of Health. Bethesda, 

MD 20894, USA 

Received 18 February 1999; received in revised form 17 March, 1999; accepted 18 March 1999 



Abstract 

*BLAST 2 Sequences*, a new BLAST-based tool for aligning two protein or nucleotide sequences, is described. While the 
standard BLAST program is widely used to search for homologous sequences in nucleotide and protein databases, one often 
needs to compare only two sequences that are already known to be homologous, coming from related species or, e.g. different 
isolates of the same virus. In such cases searching the entire database would be unnecessarily time-consuming. *BLAST 2 
Sequences* utilizes the BLAST algorithm for pairwise DNA-DNA or protein-protein sequence comparison. A World Wide 
Web version of the program can be used interactively at the NCBI WWW site (http://www.ncbi.nlm.nih.gov/gorf/bI2.html). 
The resulting alignments are presented in both graphical and text form. The variants of the program for PC (Windows), Mac 
and several UNIX-based platforms can be downloaded from the NCBI FTP site (ftp://ncbi.nlm.nih.gov). © 1999 Published 
by Elsevier Science B.V, All rights reserved. 

Keywords: Algorithm; Sequence alignment; Software 



1. Introduction 

BLAST is a rapid sequence comparison tool that 
uses a heuristic approach to construct alignments by 
optimizing a measure of local similarity [1,2]. Since 
BLAST compares protein and nucleotide sequences 
much faster than dynamic programming methods 
such as Smith-Waterman and Needleman-Wunsch 
[3,4], it is widely used for database searches. A num- 
ber of important scientific contexts, however, involve 
the comparison of only two sequences and do not 
require a time-consuming database search. This hap- 
pens for example, when one compares a series of 
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viral isolates that may differ in only several nucleo- 
tide residues per genome. Ongoing projects of se- 
quencing closely related microbial genomes, such as 
the genomes of Helicobacter pylori strains 26695 and 
J99 [5,6], make this a very common task. To meet 
these needs we have developed a program that uses 
the BLAST algorithm to produce reliable sequence 
alignments, without computer-intensive and time- 
consuming database searches. The BLAST 2 Sequen- 
ces program finds multiple local alignments between 
two sequences, allowing the user to detect homolo- 
gous protein domains or internal sequence duplica- 
tions. BLAST 2 Sequences has been very useful for 
the comparison of homologous genes from complete 
microbial genomes. Using BLAST 2 Sequences for 
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nucleotide sequence connparison of different strains 
or isolates of the same virus offers a convenient strat- 
egy to study the genome variations and evolutionary 
events, such as substitutions, insertions and dele- 
tions. 



2. Methods 

2 J. Algorithm 

'BLAST 2 Sequences' is an interactive tool that 
utilizes the BLAST engine for pairwise DNA-DNA 
or protein-protein sequence comparison and is based 
on the same algorithm and statistics of local align- 
ments that have been described earlier [1,2], The 
BLAST 2.0 algorithm generates a gapped alignment 
by using dynamic programming to extend the central 
pair of aligned residues. The heuristic methods con- 
fine the alignments to a predefined region of the path 
graph. A performance evaluation of the new gapped 
BLAST algorithm and its comparison to that of the 
original ungapped BLAST [1] and the Smith-Water- 
man algorithm [3] have been presented [4]. 

2.2 User-defined parameters 

The 'BLAST 2 Sequences' interface r:i]!ows the 
user to perform a series of searches with various 
parameters. The program can align hundreds of se- 
quences within a reasonable time. Different scoring 
matrices are provided for protein-protein compari- 
sons; each matrix is most sensitive at finding simi- 
larities at a specific evolutionary distance. The de- 
fault matrix, BLOSUM62 [7] is generally considered 
to be the best for a wide variety of distances. 

Changing the gap existence and extension penal- 
ties may change the number and length of gaps in an 
alignment. There is no analytical formula that deter- 
mines the 'best' gap values to use, so that one may 
wish to experiment with values in order to explore 
more of the alignment *space'. BLAST 2.0 uses affine 
gap costs, which assess a score (a+bk) for a gap of 
length *k' [8]; long gaps do not cost much more than 
short ones. Note that only limited values for the gap 
existence and extension penalties are supported, so 
that Karlin-Altschul statistics [9] can be applied. The 
default values of parameters are set up by Javascript 
function. Selection of the scoring matrix changes the 



default values of the gap penalties. An incorrect pa- 
rameter setting returns an error response and brings 
back the main page, allowing the user to change the 
selection. - 

Both protein and nucleotide sequences may con- 
tain regions of low compositional complexity, which 
give spuriously high BLAST scores that reflect com- 
positional bias rather than significant position by 
position alignment [10]. The SEG program [10] can 
be used for proteins and the DUST program (Tatu- 
sov and Lipman, in preparation) for nucleotides if 
the Tilter' parameter is set. One may also wish to 
view alignments without a complexity filter, espe- 
cially if it seems possible that an important part of 
the aligned sequence has been filtered over. In that 
case the dot-plot display (see Fig. 1) notes the loca- 
tions that would have been masked. 

BLAST initiates extensions between sequences us- 
ing a word, meaning that alignments need to share 
similarity along at least a 'word size' number of let- 
ters. The default value is 1 1 for nucleotide-nucleotide 
ahgnments and an exact match of 'word size' nucleo- 
tides between the two sequences is required; three is 
the default value for protein-protein matches and the 
sequences may merely be similar along the words, 
according to the matrix selected. If better sensitivity 
is needed, one should use a smaller value for the 
'word size', but it is restricted to the range 7-20 
for nucleotide comparisons and 2-3 for proteins. 

The gapped BLAST 2.0 alignment algorithm uses 
dynamic programming to extend a central pair of 
aligned residues in both directions. This approach 
considers only the alignments that drop in score no 
more than 'x_dropofr below the best score yet 
found. Increasing the *x_dropofr value increases 
the ability to produce a single gapped alignment 
rather than a collection of ungapped ones. Usually 
the value of 50 is enough to produce a single gapped 
alignment for an expected significance of 10, though 
these parameters may vary with the scoring system 
and gap costs. The expectation value is the expected 
frequency of the matches to be found merely by 
chance, according to the stochastic model of Karlin 
and Altschul [11]. To evaluate the statistical signifi- 
cance we choose to use the expectation value over 
the entire database search rather than the pairwise 
expectation. That makes it easier to compare the 
results of pairwise alignment with the results of the 
data base searches. 
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Fig. 1. The alignment of two protein sequences from Aquifex aeolicus. Blue rectangles: local alignment segments; red lines: gaps; light 
gray lines on dot plot picture represent the regions of low complexity on query sequence. The amino acids of query sequence shown in 
blue color will be filtered out if 'filter' option is applied. 
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2.3. Data constraints 

The program is not generally useful for motif-style 
searching and aligning megabase size genomic se- 
quences is not recommended. The maximum number 
of characters per sequence, that may be accommo- 
dated is '^150 kb, the optimal size of query sequence 
is about 1 kb. 



3. Results and discussion 

The result page (Fig. 1) starts with the values of 
parameters that were selected to produce the results. 
The user can recalculate the alignments by changing 
the parameters from this page and clicking on the 
'Align' button, which provides a fast and convenient 
way of comparing the results for different values of 
parameters. It might be useful to compare the results 
of protein-protein alignments for different scoring 
matrices or change the expectation value. 

The graphical representation shows schematically 
the set of gapped local alignments found between the 
two sequences with gaps shown in red. Clicking on 
the graphics brings the user to the detailed represen- 
tation of that alignment described below. The dot 
plot picture provides the user with an overview of 
sequence similarity. 

The detailed view for each segment of alignment 
includes the statistics with the percentage of identi- 
ties, positives and gaps, schematic view, and the text 
alignment view. Note that the statistics are calcu- 
lated based on the size of the NCBI non-redundant 
database. The bit score reported is the raw score 
converted to bits of information by multiplying by 
the Lambda parameter [II], with the raw score 
shown in brackets. The expectation value reports 
shows the number of alignments with that score 
one expects to find by chance. If the sequences are 
taken from the Gen Bank database and defined by gi 
or an accession number the hot link to the Entrez 
query system will be provided. 

The last part of the report shows the parameters 
of the calculations and the summary of BLAST sta- 
tistics. The report is the same as the regular BLAST 
report, providing an easy way to compare the results 
of the alignment of two sequences with the results of 
an entire database search. 
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ABSTRACT 

The BLAST programs are widely used tools for 
searching protein and DNA databases for sequence 
similarities. For protein comparisons, a variety of 
definitional, algorithmic and statistical refinements 
described here permits the execution time of the 
BLAST programs to be decreased substantially while 
enhancing their sensitivity to weak similarities. A new 
criterion for triggering the extension of word hits, 
combined with a new heuristic for generating gapped 
alignments, yields a gapped BLAST program that runs 
at approximately three times the speed of the original. 
In addition, a method is introduced for automatically 
combining statistically significant alignments pro- 
duced by BLAST into a position-specific 3core matrix, 
and searching the database using this matrix. The 
resulting Position-Specific Iterated BLAST (PSI- 
BLAST) program runs at approximately the same 
speed per iteration as gapped BLAST, but in many 
cases is much more sensitive to weak but biologically 
relevant sequence similarities. PSI-BLAST is used to 
uncover several new and interesting members of the 
BRCT superfamily. 

INTRODUCTION 

Variations of the BLAST algorithm (1) have been incorporated 
into several popular programs for searching protein and DNA 
databases for sequence similarities. BLAST programs have been 
written to compare protein or DNA queries with protein or DNA 
databases in any combination, with DNA sequences often 
undergoing conceptual translation before any comparison is 
performed. We will use the blastp program, which compares 
protein queries to protein databases, as a prototype for BLAST, 
although the ideas preisented extend immediately to other 
versions that involve the translation of a DNA query or database. 
Some of the refinements described are applicable as well to 
DNA-DNA comparison, but have yet to be implemented. 



BLAST is a heuristic that attempts to optimize a specific 
similarity measure. It pennits a tradeoff between speed and 
sensitivity, with the setting of a ^threshold' parameter, T. A higher 
value of r yields greater speed, but also an increased probability 
of missing weak similarities. The BLAST program requires time 
proportional to the product of the lengths of the query sequence 
and the database searched. Since the rate of change in database 
sizes currently exceeds that of processor speeds, computers 
running BLAST are subjected to increasing load. However, the 
conjunction of several new algorithmic ideas allow a new version 
of BLAST to achieve improved sensitivity at substantially 
augmented speed. This paper describes three major refinements 
to BLAST. 

(i) For increased speed, the criterion for extending word pairs 
has been modified. The original BLAST program seeks short 
word pairs whose aligned score is at least T, Each such *hit* is then 
extended, to test whether it is contained within a high-scoring 
alignment. For the default T value, this extension step consumes 
most of the processing time. The new * two-hit' method requires 
the existence of two non-overlapping word pairs on the same 
diagonal, and within a distance A of one another, before an 
extension is invoked. To achieve comparable sensitivity, the 
threshold parameter T must be lowered, yielding more hits than 
previously. However, because only a small fraction of these hits 
are extended, the average amount of computation required 
decreases. 

(ii) The ability to generate gapped alignments has been added. 
The original BLAST program often finds several alignments 
involving a single database sequence which, when considered 
together, are statistically significant. Overlooking any one of 
these alignments can compromise the combined result. By 
introducing an algorithm for generating gapped alignments, it 
becomes necessary to find only one rather than all the ungapped 
alignments subsumed in a significant result. This allows the T 
parameter to be raised, increasing the speed of the initial database 
scan. The new gapped alignment algorithm uses dynamic 
programming to extend a central pair of aligned residues in both 
directions. For speed, earlier heuristic methods (2,3) confined the 
alignments produced to a predefined strip of the dynamic 
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programming path graph (4). Our approach considers only 
alignments that drop in score no more than Xg below the best 
score yet seen. The algorithm is able thereby to adapt the region 
of the path graph it explores to the data. 

(iii) BLAST searches may be iterated, with a position-specific 
score matrix generated from significant alignments found in 
round / used for round t + 1. Motif or profile search methods 
frequently are much more sensitive than pairwise comparison 
methods at detecting distant relationships. However, creating a 
set of motifs or a profile that describes a protein family, and 
searching a database with them, typically has involved running 
several different programs, with substantial user intervention at 
various stages. The BLAST algorithm is easily generalized to use 
an arbitrary position-specific score matrix in place of a query 
sequence and associated substitution matrix. Accordingly, we 
have automated the procedure of generating such a matrix from 
the output produced by a BLAST search, and adapted the BLAST 
algorithm to take this matrix as input. The resulting Position- 
Specific Iterated BLAST, or PSI-BLAST, program may not be as 
sensitive as the best available motif search programs, but its speed 
and ease of operation can bring the power of these methods into 
more common use. 

After describing these refinements to BLAST in greater detail, 
we consider several biological examples for which the sensitivity 
and speed of the program are greatly enhanced. 

STATISTICAL PRELIMINARIES 

To analyze the BLAST algorithm and its refinements, we need 
first to review the statistics of high- scoring local alignments. 
BLAST employs a substitution matrix, which specifies a score s^ 
for aligning each pair of amino acids / and / Given two sequences 
to compare, the original BLAST program seeks equal-length 
segments within each that, when aligned to one another without 
gaps, have maximal aggregate score. Not only the single best 
segment pair may be found, but also other locally optimal pairs 
(3,5-7), whose scores cannot be improved by extension or 
trinmiing. Such locally optimal alignments are called 'high-scor- 
ing segment pairs* or HSPs. 

For the sake of the statistical theory, we assume a simple protein 
model in which the amino acids occur randomly at all positions 
with background probabilities Pj. We require that the expected 

score for two random amino acids ^ PiPiS-,^ be negative. Given 

iJ 

the P[ and sy, the basic theory (8,9) yields two calculable 
parameters, X and K, which can be used to convert nominal HSP 
scores to normalized scores, thereby rendering all scoring 
systems directly comparable from a statistical perspective. The 
normalized score for an HSP is given by the equation: 

^, ^ XS -\nK J 
In 2 

In this paper, a nominal score is given without units, while a score 
normalized by equation 1 is said to be expressed in bits (10,1 1). 
When two random protein sequences of sufficient lengths m and 
n are compared, the number E of distinct HSPs with normalized 
score at least S' expected to occur by chance is well approximated 
by: 

E = N/2'' 2 



where N = mn \s the search space size (8-10). If a protein is 
compared to a whole database rather than a single sequence, n is 
the database length in residues. Equation 2 may be inverted to 
yield = log2(M£r), the normalized score required to achieve a 
particular £- value. In a typical current database search, a protein 
of length 250 might be compared to a protein database of 50 QOO 000 
total residues. To achieve a nnarginally significant £-value of 
0.05, a normalized score of -38 bits is necessary. 

While the theory just outlined has not been proved for gapped 
local alignments and their associated scores, computational 
experiments strongly suggest that it reniains valid (3,12-15). The 
statistical parameters X and K, however, are no longer supplied by 
theory but must be estimated using comparisons of either 
simulated or real but unrelated sequences. To distinguish below 
whether a given set of parameters X and K refer to gapped or 
ungapped alignments, we use the subscripts g and u respectively. 

When gaps are not allowed, a further important theorem states 
that within HSPs the aligned pair of letters (iJ) tends to occur with 
the 'target frequency*: 

q- = P.P.e^^^ii 3 
i ij 

The ^ij of equation 3 sum to 1 ; indeed, X^^ is calculated to be the 
unique positive number for which this is the case (8,9). The scores 
5ij are optimal for detecting alignments with these particular target 
frequencies (8,10), and by inverting equation 3 to = 
[ln(^ij/PiPj)]/iu. scores may be chosen, with arbitrary scale, that 
correspond to any desired set of qiy The popular PAM (16,17) and 
BLOSUM (18) substitution matrices are constructed with explicit 
use of this log-odds formula. No corresponding result has been 
established for gapped alignment scoring systems. However, if 
the gap costs used are sufficiently large, it is expected that the 
target frequencies observed in high-scoring local alignments of 
random sequences will not differ gready from those for the 
no-gap case. 

REFINEMENT OF THE BASIC ALGORITHM: THE 
TWO-HIT METHOD 

The central idea of the BLAST algorithm is that a statistically 
significant alignment is likely to contain a high-scoring pair of 
aligned words. BLAST fu^t scans the database for words 
(typically of length three for proteins) that score at least T when 
aligned with some word within the query sequence. Any aligned 
word pair satisfying this condition is called a hit. The second step 
of the algorithm checks whether each hit lies within an alignment 
with score sufficient to be reported. This is done by extending a 
hit in both directions, until the running alignment's score has 
dropped more than X below the maximum score yet attained. This 
extension step is computationally quite costly; with the T and X 
parameters necessary to attain reasonable sensitivity to weak 
alignments, the extension step typically accounts for >90% of 
BLAST'S execution time. It is therefore desirable to reduce the 
number of extensions performed. 

Our refined algorithm is based upon the observation that an 
HSP of interest is much longer than a single word pair, and may 
therefore entail multiple hits on the same diagonal and within a 
relatively short distance of one another (The diagonal of a hit 
involving words starting at positions {x\,X2) of the database and 
query sequences may be defined as x\ -X2- The distance between 
two hits on the same diagonal is the difference between their first 
coordinates.) This signature may be used to locate HSPs more 
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Figure 1. The sensitivity of the two-hit and one-hit heuristics as a function of 
HSP score. Using the BLOSUM-62 amino acid substitution matrix ( 1 8)» and the 
taiget frequencies ^ij impHed by equation 3 and the background amino acid 
frequencies Pi of Robinson and Robinson (20), 100 000 model HSPs were 
generated for each of the nominal scores 37-92, corresponding to normalized 
scores 19.9-45.1 bits. It was determined by inspection whether each HSP failed 
to contain two non-overlapping length-3 word pairs with nominal score at least 
1 1, and within a distance 40 of one another, or a single length-3 word pair with 
nominal score at least 13. The conresponding probabilities of missing an HSP 
using the two-hit heuristic with T= \ \, and the one-hit heuristic with 7= 13, 
are plotted as a function of normalized HSP score. The two-hit method is more 
sensitive for HSPs with score at least 33 bits. 



efficiently. Specifically, we choose a window length A, and 
invoke an extension only when two non-overlapping hits are 
found within distance A of one another on the same diagonal. Any 
hit that overlaps the most recent one is ignored. Efficient 
execution requires an array to record, for each diagonal, the first 
coordinate of the most recent hit found. Since database sequences 
are scanned sequentially, this coordinate always increases for 
successive hits. The idea of seeking multiple hits on the same 
diagonal was first used in the context of biological database 
searches by Wilbur and Lipman (19). 

Because we require two hits rather than one to invoke an 
extension, the threshold parameter T must be lowered to retain 
comparable sensitivity. The effect is that many more single hits 
are found, but only a small fraction have an associated second hit 
on the same diagonal that triggers an extension. The great 
majority of hits may be dismissed after the minor calculation of 
looking up, for the appropriate diagonal, the coordinate of the 
most recent hit, checking whether it is within distance A of the 
current hit*s coordinate, and finally replacing the old with the new 
coordinate. Empirically, the computation saved by requiring 
fewer extensions more than offsets the extra computation 
required to process the larger number of hits. 

To study the relative abilities of the one-hit and two-hit methods 
to detect HSPs of varying score, we model proteins using the 
background amino acid frequencies of Robinson and Robinson 
(20), and use the BLOSUM-62 substitution matrix (18) for 
sequence comparison. Given these Pj and sy, the statistical 
parameters for ungapped local alignments are calculated to be 
X^i ~ 0.3176 and = 0.134. Using equation 3 above, we may 
calculate the <?ij for which the scoring system is optimized, and 
employ these target frequencies to generate model HSPs. Finally, 
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Figure 2. The BLAST comparison of broad bean leghemoglobin I (87) 
(S WISS-PROT accession no. P02232) and horse P-globin (88) (SWISS-PROT 
accession no. P02062). The 15 hits with score at least 13 are indicated by plus 
signs. An additional 22 non-overlapping hits with score at least 1 1 arc indicated 
by dots. Of these 37 hits, only the two indicated pairs are on the same diagonal 
and within distance 40 of one another. Thus the two-hit heuristic with 7=11 
triggers two extensions, in place of the 15 extensions invoked by the one-hit 
heuristic with T= 13. Because this is just one example, the relative numbers of 
hits and extensions at the various settings of 7* correspond only roughly to the 
ratios found in a full database search. An ungapped extension of the leftward 
of the two hit pairs yields an HSP with nominal score 45, or 23.6 bits, calculated 
using Xu and K^. 



we evaluate the sensitivity of the one-hit and two-hit BLAST 
heuristics using these HSPs. 

The one-hit method will detect an HSP if it somewhere contains 
a length- W word of score at least T. For W = 3 and 7=13, Figure 
1 shows the empuically estimated probability that an HSP is 
missed by this method, as a function of its normalized score. The 
two-hit method will detect an HSP if it contains two non- 
overiapping length-lV words of score at least 7, with starting 
positions that differ by no more than A residues. For W=3,T= 11 
and A = 40, Figure 1 shows the estimated probability that an HSP 
is missed by this method, as a function of its normalized score. For 
HSPs with score at least 33 bits, the two-hit heuristic is more 
sensitive. 

To analyze the relative speeds of the one-hit and two-hit 
methods, using the parameters studied above, we note that the 
two-hit method generates on average -3.2 times as many hits, but 
only -0. 14 times as many hit extensions (Fig. 2). Because it takes 
approximately one ninth as long to decide whether a hit need be 
extended as actually to extend it, the hit-processing component of 
the two-hit method is approximately twice as rapid as the same 
component of the one-hit method. 

TRIGGERING THE GENERATION OF GAPPED 
ALIGNMENTS 

Figure 1 shows that even when using the original one-hit method 
with threshold parameter 7=13, there is generally no greater than 
a 4% chance of missing an HSP with score >38 bits. While this 
would appear sufficient for most purposes, the one-hit default T 
parameter has typically been set as low as 11, yielding an 
execution time nearly three times that for 7= 13. Why pay this 
price for what appears at best marginal gains in sensitivity? The 
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Leghenoglobln 4 3 PSFLKDSAGWDSPKLGAHAEKVFGHVRDSAVQLRATGEW- -LDGKDGS 

F L + V+ +PK+ AH +KV L + GE V LD G+ 

Beta globin 4 5 FGDLSNPGAVMGNPKVKAHGKKV LHSFGEGVHHLDMLKGTFAALSE 



Legherooglobin 91 IHIQKGVLDP-HFWVKEALLKTIKEASGDKWSEELSAAWEVAYDGIATAI 140 

+H K +DP +F ++ L+ + G ++ EL A+++ G+A A+ 
Beta globin 91 LHCOKLHVDPENFRLLGNVLVWLARHFGKDFTPELQASYQKWAGVAHAL 141 



Figure 3. A gapped extension generated by BLAST for the comparison of broad bean leghemoglobin I (87) and horse P-globin (88). (a) The region of the path graph 
explored when seeded by the alignment of alanine residues at respective positions 60 and 62. This seed derives from the HSP generated by the leftward of the two 
ungapped extensions illustrated in Figure 2. The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and a cost 
of 10 + i for gaps of length k. (b) The path corresponding to the optimal local alignment generated, superimposed on the hits described in Figure 2. The original BLAST 
program, using the one-hit heuristic with 7*= 1 1 . is able to locate three of the five HSPs included in this alignment, but only the and last achieve a score sufficient 
to be reported, (c) The optimal local alignment, with nominal score 75 and normalized score 32.4 bits. In the context of a search of SWISS-PROT (26), release 34 
(21 219 450 residues), using the leghemoglobin sequence (143 residues) as query, the £-value is 0.54 if no edge-effect correction (22) is invoked. The original BLAST 
program locates the first and last ungapped segments of this alignment. Using sum-statistics with no edge-effect conection, this combined result has an f- value of 
3 1 (21,22). On the central lines of tlie alignment, identities are echoed and substitutions to which the BLOSUM-62 matrix (18) gives a positive score are indicated 
by a '+' symbol. 



reason is that the original BLAST program treats gapped 
alignments implicitly by locating, in many cases, several distinct 
HSPs involving the same database sequence, and calculating a 
statistical assessment of the combined result (21,22). This means 
that two or more HSPs with scores well below 38 bits can. in 
combination, rise to statistical significance. If any one of these 
HSPs is missed, so may be the combined result. 

The approach taken here allows BLAST to simultaneously 
produce gapped alignments and mn significantly faster than 
previously. The central idea is to trigger a gapped extension for 
any HSP that exceeds a moderate score chosen so that no more 
than about one extension is invoked per 50 database sequences. 
(By equation 2, for a typical-length protein query, 5g should be set 
at -22 bits.) A gapped extension takes much longer to execute 
than an ungapped extension, but by performing ve;y few of them 
the fraction of the total running time they consume can be kept 
relatively low. 

By seeking a single gapped alignment, rather than a collection 
of ungapped ones, only one of the constituent HSPs need be 
located for the combined result to be generated successfully. This 
means that we may tolerate a much higher chance of missing any 
single moderately scoring HSP. For example, consider a result 
involving two HSPs, each with the same probability P of being 
missed at the hit-stage of the BLAST algorithm, and suppose that 
we desire to find the combined result with probability at least 



0.95. The original algorithm, needing to find both HSPs, requires 
2P - p2 < Q or P less than -0.025. In contrast, the new 
algorithm requires only that < 0.05, and thus can tolerate P as 
high as 0.22. This permits the T parameter for the hit-stage of the 
algorithm to be raised substantially while retaining comparable 
sensitivity — from 7= 11 to T- 13 for the one-hit heuristic. (The 
two-hit heuristic described above lowers Tback to 1 1.) As will be 
discussed below, the resulting increase in speed more than 
compensates for the extra time required for the rare gapped 
extension. 

In summary, the new gapped BLAST algorithm requires two 
non-overlapping hits of score at least 7, within a distance A of one 
another, to invoke an ungapped extension of the second hit. If the 
HSP generated has normalized score at least 5g bits, then a gapped 
extension is triggered. The resulting gapped alignment is reported 
only if it has an £- value low enough to be of interest. For example, 
in the pairwise comparison of Figure 2, the ungapped extension 
invoked by the hit pair on the left produces an HSP with score 
23.6 bits (calculated using Xu and Ky^. This is sufficient to trigger 
a gapped extension, which generates an alignment with score 32.4 
bits (calculated using and and £- value of 0.5 (Fig. 3). The 
original BLAST program locates only the first and last ungapped 
segments of this alignment (Fig. 3c), and assigns them a 
combined £-value >50 times greater. 
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THE CONSTRUCTION AND STATISTICAL 
EVALUATION OF GAPPED LOCAL ALIGNMENTS 

The standard dynamic programming algorithms for pairwise 
sequence alignment perform a fixed amount of computation per 
cell of a path graph, whose dimensions are the lengths of the two 
sequences being compared (23-25). In order to gain speed, 
database search algorithms such as Fasta (2) and an earlier gapped 
version of BLAST (3) sacrifice rigor by confining the dynamic 
programming to a banded section of the fill! path graph (4), 
chosen to include a region of already identified similarity. One 
problem with this approach is that the optimal gapped alignment 
may stray beyond the confines of the band explored. As the width 
of the band is increased to reduce this possibility, the speed 
advantage of the algorithm is vitiated. 

We have accordingly taken a different heuristic approach to 
constmcting gapped local alignments, which is a simple general- 
ization of BLAST'S method for constructing HSPs. The central 
idea is to consider only cells for which the optimal local alignment 
score falls no more than below the best alignment score yet 
found. Starting from a single aligned pair of residues, called the 
seedy the dynamic programming proceeds both forward and 
backward through the path graph (Zheng Zhang et al, manuscript 
in preparation) (Figs 3a and 4). The advantage of this approach 
is that the region of the path graph explored adapts to the 
alignment being constmcted. The alignment can wander arbitrari- 
ly many diagonals away from the seed, but the number of cells 
expanded on each row tends to remain limited, and may even 
shrink to zero before a boundary of the padi graph is encountered 
(Fig. 4). The Xg parameter serves a similar function to the 
band-width parameter of the earlier heuristic, but the region of the 
path graph it implicitly specifies be explored is in general more 
productively chosen. 

An important element for this heuristic is the intelligent choice 
of a seed. Given an HSP whose score is sufficiently high that it 
triggers a gapped extension, how does one choose a residue pair 
to force into alignment? While more sophisticated approaches are 
possible, the simple procedure we have implemented is to locate, 
along the HSP, the length- 11 segment with highest alignment 
score, and use its central residue pair as the seed. If the HSP itself 
is shorter than 11, a central residue pair is chosen. For example, 
the first ungapped region in the alignment of Figure 3c constitutes 
the HSP that triggered the alignment. The highest-scoring 
length-il segment of this HSP aligns leghemoglobin residues 
55-65 with P-globin residues 57-67, Thus the alanine residues at 
respective positions 60 and 62 are used as the seed for the gapped 
extension illusU"ated in Figure 3a. As discussed in the perform- 
ance evaluation section below, this procedure is extremely good 
at selecting seeds that in fact participate in an optimal alignment. 

Most gapped extensions are triggered by chance similarities, 
and are therefore likely to be of limited extent, as illustrated in 
Figiu-e 4. The reverse extension in this example explores -2000 
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Figure 4. The path graph region explored by BLAST during a gapped extension 
for the comparison of broad bean leghemoglobin I and the ElB protein small 
T-antigen from human adenovirus type 4 (89) (SWISS-PROT accession no. 
P 10406). The dropoff parameter is the nominal score 40, used in conjunction 
with BLOSUM-62 substitution scores and 10 + /: gap costs. The 22.7 bit HSP 
that triggers this extension, involving leghemoglobin residues 119-140 and 
adenovirus residues 101-122, is merely a random similarity, and not part of a 
larger and higher-scoring alignnienL The gapped extension is seeded by the 
alignment of residues 124 and 106. The optimal alignment score through points 
in the path graph drops steadily as one moves beyond the triggering HSP, and 
the reverse extension terminates before the beginning of either protein is 
reached. A total of 2766 path graph cells are explored, with the reverse 
extension accounting for 2047 of these cells. 



path graph cells, so that a typical two-way gapped extension that 
does not encounter the end of either sequence is expected to 
involve -4000 cells. Because Sg is set so that a gapped extension 
is invoked less than once per 50 database sequences, fewer than 
80 cells need be explored per database sequence. 

The execution time required for a gapped extension is -500 
times that for an ungapped extension. However, by triggering 
gapped extensions in the manner described, while simultaneously 
raising T for the single-hit version of BLAST from 11 to 13, 
approximately one gapped extension is invoked for every 4000 
ungapped extensions avoided. Because the number of ungapped 
extensions is reduced by about two thirds, the total time spent on 
the extension stage of BLAST is cut by well over half. Of course, 
the two-hit strategy described above reduces the time needed for 
the ungapped extensions still further Once program overhead is 
accounted for, the net speedup is a factor of about three. 

For any alignment actually reported, a gapped extension that 
records *traceback' information (25) needs to be executed. To 
increase BLAST'S accuracy in producing optimal local align- 
ments, these gapped extensions use by default a substantially 
larger Xg parameter than employed during the program^s search 
stage. 



Table 1. Relative times spent by the original and gapped BLAST programs on various algorithmic stages 
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The times required by various steps of the BLAST algorithm 
vary substantially from one query and one database to another 
Table 1 shows typical relative times spent by the original and the 
gapped BLAST programs on various algorithmic stages. The 
* original BLAST program is represented, here and below, by a 
variant form of blastp version 1.4.9, modified so that it uses the 
same edge-effect correction (22) and background amino acid 
frequencies as the 'gapped BLAST*. The times represent the 
average for three different queries, with the time for the original 
BLAST program normalized in each instance to 100 units. 

More concretely, to search SWISS-PROT (26), release 34 
(59 576 sequences; 21 219 450 residues), with the length-567 
influenza A virus hemagglutinin precursor (27) as query, the 
original BLAST program requires 45.8 s, and the gapped BLAST 
program 15.8 s. This timing experiment, and others referred to 
below, was run on one 200 MHz R 10000 cpu processor of a 
lightly loaded SGI Power Challenge XL computer with 2.5 
Gbytes of RAM. This machine runs the operating system IRIX, 
version 6.2, which is an implementation of UNIX. We used the 
standard SGI C compiler, with the -O flag for optimization, to 
compile all versions of the programs. The times reported are the 
user times given by the time conMiand, and are for the better of 
two identical runs. 

A closely related type of gapped extension routine to that used 
here was developed by G. Myers during the evaluation of the 
original BLAST algorithm. It was not included in the publicly 
distributed code primarily because the then current strategy of 
extending every hit decreased the algorithm's speed unduly for the 
relatively small gain in sensitivity realized (1). 

As discussed above, the statistical significance of gapped 
alignments may be evaluated using the two statistical parameters 
Xg and K^. The current version of the Fasta program (2) estimates 
these parameters on each mn, by analyzing the distribution of 
alignment scores produced by all the sequences in the database. 
BLAST gains speed by producing alignments -or only the few 
database sequences likely to be related to the query, and therefore 
does not have the option of estimating and Kg on the fly. 
Instead, it uses estimates of these parameters produced before- 
hand by random simulation (3). A drawback of this approach is 
that the program may not accept an arbitrary scoring system, for 
which no simulation has been performed, and still produce 
accurate estimates of statistical significance. The original BLAST 
programs, in contrast, because they dealt only with ungapped 
local alignments, could derive and from theory for any 
scoring matrix (8,9). 

ITERATED APPLICATION OF BLAST TO 
POSITION-SPECIFIC SCORE MATRICES 

Database searches using position-specific score matrices, also 
called profiles or motifs, often are much better able to detect weak 
relationships than are database searches that use a simple 
sequence as query (28-38). Employing these methods, however, 
frequently has involved the use of several different programs and 
a fair degree of expertise. Accordingly, to render the power of 
motif searches more readily available, we have written a 
procedure to construct a position-specific score matrix automati- 
cally from the output of a BLAST run, and modified BLAST to 
operate using such a matrix in the place of a simple query. The 
resulting PSI-BLAST program often is substantially more 
sensitive than the corresponding BLAST program, but for each 



iteration takes little more than the same time to run. In related 
work, Henikoff and Henikoff (39) have described how, short of 
modifying BLAST so that it may operate on a position-specific 
score matrix, a single artificial sequence that approximates such 
a matrix may be used as a query with the original BLAST 
programs. 

The constmction of a position-specific score matrix is a 
multi-stage process, and at each stage a choice must be made 
among a number of alternative routes. We have been guided by 
the goals of automatic operation, sj)eed of execution, and general 
simplicity. The issues discussed below are: (i) general architec- 
ture of the score matrix; (ii) construction of the multiple 
alignment from which the matrix is derived; (iii) weights for 
sequences within the multiple alignment, and evaluation of the 
effective number of independent observations it constitutes; 
(iv) estimation of target frequencies, and the construction of 
matrix scores; (v) applying BLAST to a position-specific matrix, 
and the statistical evaluation of search results. We do not claim 
our current implementation is optimal, and it is likely that over 
time some of its details will change. 

Score matrix architecture 

The alignment of a simple sequence with a pattern embodied by 
a position-specific score matrix is almost completely analogous 
to the alignment of two simple sequences. The only real 
difference is that the score for aligning a letter with a pattern 
position is given by the matrix itself, rather than with reference to 
a substitution matrix. For proteins, a query of length L and a 
substitution matrix of dimension 20 x 20 are replaced by a 
position-specific matrix of dimension L x 20. Position-specific 
gap costs may be defined as well (34,40). As with pairwise 
sequence comparison, one may choose among finding the best 
global alignment of the matrix and the simple sequence (23), 
finding the best alignment of the complete matrix with a segment 
of the sequence (41), and finding the best local alignment of the 
matrix and sequence (24). 

Position-specific protein score matrices draw their power from 
two sources. The first is improved estimation of the probabilities 
with which amino acids occur at various pattern positions, leading 
to a more sensitive scoring system. The second is relatively 
precise definition of the boundaries of important motifs. By 
demanding the complete alignment of one or more motifs, rather 
than seeking an arbitrary local alignment, the size of the search 
space may be greatly reduced, thereby lowering the level of 
random noise. Unfortunately, there are many obstacles to 
automating well the delineation of a set of motifs from the output 
of a database search. The query sequence may contain a variety 
of different domains, and share different subsets of them with 
different proteins in the database. Furthermore, defining the 
proper extent of even a single motif may be challenging (42). 

Accordingly, we have chosen to forgo the potential advantages 
of resuicting the length of our derived matrices, and then 
demanding that they be completely aligned with segments of 
database sequences (41). Instead, each matrix we construct has 
length precisely equal to that of the original query sequence. 
When searching the database with such a matrix, we seek local 
alignments, in full analogy to those sought by BLAST when used 
for straightforward sequence-sequence comparison. Finally, we 
do not attempt to derive position-specific gap scores for use with 
our position-specific substitution scores. Instead, in each iteration 
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Histidine triad protein 68 WEKHPHGTSLTFSMQDGPEAGQTVKHVHVHVL- -PRKAGDF 107 

GP +G ++ H H+ +L P K F 
Phosphorylase 130 AUJNEESDKRHMVPYNSGPASGSSLDHKHLQILOKPEKFVTF 171 



Figure 5. (a) The multiple alignment generated by PSI-BLAST when the human fragile histidine triad (HIT) protein (61) (SWISS-PROT accession no. P49789) is 
compared to SWISS-PROT. All pairwise local alignments have £-value <0.0I, and are identified in SWISS-PROT as belonging to the HIT family. Thick bars within 
the six database sequences represent segments that align with various segments from the query. In constructing sequence weights for the indicated multiple alignment 
column, corresponding to residue 108 of the query, only the shaded portions of the multiple alignment are used, (b) A local alignment of the human HIT protein and 
H.injluenzae galactose- 1 -phosphate uridylyltransferase (63) (SWISS-PROT accession no, P31764). In its first position-specific iteration, PSI-BLAST gives this 
alignment a score of 45.4 biu, corresponding to an £-value of 4 x IQ-^. symbols reflect positive BLOSUM-62 matrix scores, even though a position-specific matrix 
is used to construct the alignment, (c) A local alignment of the human HIT protein and yeast 5',5"'-Pl ,P4-teiraphosphate phosphorylase I (64) (SWISS-PROT accession 
no. PI 6550). In its second position-specific iteration, PSI-BLAST gives this alignment a score of 43.4 bits, conresponding to an £- value of 2 x I(H. 



of PSI-BLAST, we employ the same gap scores that are used in 
the first, simple BLAST mn. Our reasons are that there is no good 
theory for deriving gap costs from a multiple alignment and that, 
as will be discussed below, by eschewing position-specific gap 
costs we can make a reasonable estimate of the statistical 
significance of the resulting local alignments. 

Multiple alignment construction 

To produce a multiple alignment from the BLAST output, we 
simply collect all database sequence segments that have been 
aligned to the query with value below a threshold, by default 
set to 0.01. The query is used as a master, or template, for 
constmcting a multiple alignment M. Any row (i.e., database 
sequence segment) identical to the query segment with which it 
aligns is purged, and only one copy is retained of any rows that 
are >98% identical to one another. Pairwise alignment columns 
that involve gap characters inserted into the query are simply 
ignored, so that M has exactly the same length as the query. 
Because we are dealing with local alignments, the columns of M 
may involve varying numbers of sequences, and many columns 
may include nothing but the query. We make no attempt to 
improve M by comparing database sequences with one another, 
or by any other true multiple alignment procedure. 

As will be discussed, the matrix scores constructed for a given 
alignment column should depend not only upon the residues 
appearing there, but upon those in other columns as well. To make 
this dependency easy to formulate, however, we need to prune oiu" 
raw multiple alignment M to a simpler *reduced' one. This 
pruning is done independently for each column, so the reduced 
multiple alignment Mq will in general vary from one column C 
to the next. To construct Afc, we first specify the set R of 
sequences it includes to be exactly those that contribute a residue 
to column C. We then define the columns of Mq to be just those 
columns of M in which all the sequences of R are represented. By 
constmction, the reduced multiple alignment Mq has residues or 



gap characters in every row and column (Fig. 5a), and is therefore 
amenable to the various manipulations described below. 

Sequence weights 

When constructing a score matrix from a multiple alignment, it 
is a mistake to give all sequences of the alignment equal weight. 
A large set of closely related sequences carries little more 
information than a single member, but its size alone may allow it 
easily to ^outvote' a snnall number of more divergent sequences. 
One way past this difficulty is to assign weights to the various 
sequences, with those having many close relatives receiving 
smaller weight. The many sequence weighting methods that have 
been proposed (43-5 1) often produce roughly equivalent results. 
Because of its speed and simplicity, we have implemented a 
modified version of the sequence weighting method of Henikoff 
and Henikoff (47). Gap characters are treated as a 21st distinct 
character, and any columns consisting of identical residues are 
ignored in calculating weights. In speaking of a colunnn's 
observed residue frequencies /j, we shall henceforth mean its 
weighted rather its raw frequencies. 

In constructing matrix scores, not only a column's observed 
residue frequencies are important, but also the effective number 
of independent observations it constitutes: a column consisting of 
a single valine and a single isoleucine carries different infomna- 
tion than one consisting of five independently occuning instances 
of each. Accordingly, we need to estimate the relative number 
of independent observations constituted by the alignment Mq- A 
simple count of the number of sequences in Mq is a poor measure, 
for 10 identical sequences imply fewer independent observations 
than do 10 divergent ones. We thus propose as a simple first 
estimate for Nc the mean number of different residue types, 
including gap characters, observed in the various columns of Mc- 
This estimate is clearly not ideal, as it saturates at 21 no matter 
how many independent sequences are contained in Mc- However, 
for the data we are likely to encounter, is typically much 
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smaller than 2 1 , and therefore perhaps a good enough approxima- 
tion for our purposes. As will be seen, it is not the absolute value 
of A^c that is important, but rather its relative value from one 
column to another. Nq is essentially the same measure of 
alignment variability as that proposed by Henikoff and Henikoff 
(52) for use in a different manner 

Target frequency estimation 

Given a multiple alignment, many methods for generating score 
matrices have been advanced (28-37,42,52-54). The prescrip- 
tion with perhaps the best theoretical foundation is that the scores 
for a specific pattern position be of the form log {Q\IP\), where Q\ 
is the estimated probability for residue i to be found in that column 
(29,30,32,33,36.37,42,52-54). This leaves open the question 
how best to estimate the Q{. 

Given a multiple alignment involving a large number of 
independent sequences, the estimate of Q\ for a specific colunui 
should converge simply to the observed frequency of residue / in 
that column. However, in addition to the sequence weighting 
issues discussed above, factors that complicate estimating the Q\ 
include small sample size (30), and prior knowledge of relation- 
ships among the residues (16,37,53). Various studies suggest that 
the best currently available method for estimating the Q\ is that of 
Dirichlet mixtures (52-56). However, because it often performs 
nearly as well (52), and due to its relative simplicity, we have 
implemented the data-dependent pseudocount method intro- 
duced by Tatusov et al. (37). This method uses the prior 
knowledge of amino acid relationships embodied in the substitu- 
tion matrix jjj to generate residue pseudocount frequencies g\, 
which are averaged with the observed frequencies f\ to estimate 
the Qi. 

Specifically, for a given column C, we constnict pseudocount 
frequencies using the formula; 

where the are the target frequencies implicit in the substitution 
matrix, and given by equation 3. Intuitively, those residues 
favored by the substitution matrix to align with the residues 
actually observed receive high pseudocount frequencies. We then 
estimate Q\ by: 



where a and P are the relative weights given to observed and 
pseudocount residue frequencies. So that the scores we constmct 
will reduce to s\y in columns where nothing has been aligned to the 
query sequence, we let a = A^c - 1- P remains an arbitrary 
pseudocount parameter; the larger its value, the greater the 
emphasis given to prior knowledge of residue relationships vis a 
vis observed residue frequencies. We have found empirically that, 
in conjunction with our method for calculating a, a reasonably 
good setting for P is 10. 

BLAST applied to position-specific score matrices 

The initial step of the BLAST algorithm is the construction of a 
list of words that align to query words with score at least T Only 
minor modifications to the code are necessary for this step to be 



performed on a query consisting of a position-specific matrix 
rather than a simple sequence. The same holds for the ungapped 
and gapped extension steps of BLAST. One important issue is 
whether key parameters such as Tand Xg, used at various heuristic 
stages of the algorithm and tuned to simple sequence comparison, 
can be applied unchanged to position-specific matrices without 
degrading unduly either the speed or sensitivity of database 
searches. We approach this problem by ensuring that the scale 7^ 
of the matrix scores produced internally by PSI-BLAST corre- 
sponds to that of the substitution matrix s\y In other words, we 
calculate the scores for a column of the matrix as [ln(Qi/Pt)]/^- 

There is no analytic theory with which to estimate the statistical 
significance of a gapped alignment of a position-specific score 
matrix and a simple sequence. However, one may hypothesize 
that for a score matrix constructed to the same scale as ^y, a given 
set of gap costs should produce the same gapped alignment scale 
parameter Xg as for s\y This would be convenient, because then 
PSI-BLAST could estimate statistical significance without 
expending after each iteration the substantial time required to 
estimate and Kg by random simulation. To test this hypothesis, 
we performed a number of statistical tests on PSI-BLAST 
generated score matrices, scaled to have - 0.3176, the value 
applicable to previously published BLOSUM-62 simulations (3). 

First, we searched SWISS-PROT using as query the length-567 
influenza A virus hemagglutinin precursor (27), and captured the 
score matrix constructed by PSI-BLAST from the 128 local 
alignments with £- value < 0.01 . We then compared this matrix to 
10 000 random sequences of length 567, generated using the 
background amino acid frequencies of Robinson and Robinson 
(20). A gap of length k was charged a cost of 10 -f Counts of 
the optimal local alignment scores, calculated using an appropri- 
ately modified version of the Smith- Waterman algorithm (24), 
are plotted in Figure 6. Also shown is the best fitting extreme 
value distribution (3,15) which, using the edge-effect correction 
described by Altschul and Gish (3), has statistical parameters Xg 
= 0.25 1 and A:g = 0.03 1 . It is apparent that the distribution fits the 
random trial reasonably well; a goodness-of-fit test with 34 
degrees of freedom has value 41.8, which is lower than one would 
expect 20% of the time even were the theory precisely valid. This 
supports the idea that the statistical theory described above 
applies to local alignments of position-specific score matrices and 
simple sequences. Furthermore, the estimate A.g = 0.251 ± 0.003 
agrees to within experimental error with the value 0.255 
previously published for these gap costs (3). Similar agreement 
was obtained with a number of other protein sequences as initial 
query (results not shown), and in all cases the much less important 
Kg parameter could be estimated accurately as well. In general, 
values of Xg for the comparison of position-specific score 
matrices with simple sequences appear to differ by <2% from the 
values for simple painvise sequence comparison. Using these 
precomputed values for Xg should thus entail an error of less than 
one bit for PSI-BLAST scores <50 bits, corresponding to a factor 
of less than two in the estimation of statistical significance. 

PERFORMANCE EVALUATION 

To test more directly the statistics used by PSI-BLAST, we 
compared query sequences from 1 1 large and well characterized 
protein families to the SWISS-PROT database, and then ran the 
position-specific score matrices generated against a shuffled 
version of the same database. For each query, we recorded the 
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Figure 6. Tlie distribution of optimal local alignment scores from the 
comparison of a position-specific score matrix with 10 000 random protein 
sequences. The score matrix was constmcted by PSI-BLAST from the 1 28 local 
alignments with £- value <0.01 found in a search of SWISS-PROT using as 
query the length-567 influenza A virus hemagglutinin precursor (27) (SWISS- 
PROT accession no. P03435). Tlie random sequences, each of length 567, were 
generated using the amino acid frequencies of Robinson and Robinson (20). 
Optimal local alignment scores were calculated using the position-specific 
matrix in conjunction with 10 + k gap costs. The extreme value distribution that 
best fits the data (3,15) is plotted. A goodness-of-fit test with 34 degrees of 
freedom has value 41.8, corresponding to a P-value of 0.20. 

lowest E-value found, as well as the number of shuffled sequences 
yielding E-values <1 and 10. For conq)arison, we peiformed the 



identical shuffled-database test on the gapped and original 
versions of BLAST. To reduce the probability that high-scoring 
alignments were missed due to the heuristic nature of the 
algorithms, we performed these tests with 7=9 rather than the 
default value of 11. The results are given in Table 2. For the 1 1 
queries, the median of the low PSI-BLAST E-values was 0,87, 
which corresponds to a median P- value of 0.58 (8,9). The mean 
numbers of shuffled database sequences with £-values <1 and 10 
were 1.0 and 8.7, respectively, within 20% of the expected values 
of 1 .0 and 10.0. The equivalent tests for the ungapped and gapped 
versions of BLAST also yielded results that diverged from theory 
by <50%. 

The ability to estimate with reasonable accuracy the signifi- 
cance of gapped local matrix-sequence alignments permits us to 
automate the construction of position-specific score matrices 
during multiple iterations of the PSI-BLAST program. After each 
iteration, we generate a new multiple alignment simply by 
collecting those alignments with £-value lower than a defined 
threshold. An interactive version of PSI-BLAST allows the user 
to override either the inclusion or exclusion of specific local 
alignments. Once a given database sequence has been used in the 
generation of a position-specific score matrix, low ^-values for 
this sequence are virtually guaranteed in future iterations, for the 
sequence is to a certain extent being compared with itself The 
biological relevance of PSI-BLAST output thus depends criti- 
cally on avoiding the inappropriate inclusion of sequences in the 
multiple alignment constmcted. Specifically, the utility of the 
score matrix produced is immediately vitiated by the inclusion of 
any alignment involving a region of highly biased amino acid 
composition (57,58). 



Table 2. The comparison of various query sequences with a shuffled version of SWISS-PROT 



Protein family 


SWISS-PROT 


Original BLAST 




Gapped BLAST 




PSI-BLAST 






accession no. 


Low 


No. of seqs 


Low 


No. of seqs 


Low 


No. of seqs 




of query 


£- value 


with £- value 


£- value 


with £- value 


value 


with value 








<1 


<10 




<1 


^10 




<1 


<10 


Serine protease 


P00762 


0.86 


1 


7 


3.0 


0 


4 


0.94 


1 


8 


Serine protease inhibitor 


P01008 


3.9 


0 


4 


0.078 


1 


9 


1.5 


0 


9 


Ras 


POllll 


3.4 


0 


8 


3.4 


0 


7 


1.1 


0 


9 


Globin 


P02232 


2.4 


0 


7 


2.8 


0 


5 


8.2 


0 


2 


Hemagglutinin 


P03435 


0.11 


2 


11 


0.46 


3 


16 


0.87 


1 


8 


Interferon a 


P05013 


2.4 


0 


6 


0.27 


2 


4 


O.ll 


2 


11 


Alcohol dehydrogenase 


P07327 


1.5 


0 


2 


0.80 


1 


5 


1.5 


0 


9 


Histocompatibility antigen 


P10318 


0.91 


1 


7 


0.13 


1 


7 


0.0031 


2 


6 


Cytochrome P450 


PI 0635 


0.84 


2 


5 


8.5 


0 


3 


0.46 


I 


15 


Glutathione transferase 


PI 4942 


1.0 


1 


10 


3.3 


0 


3 


0.30 


2 


9 


H'*"-transporting ATP synthase 


P20705 


0.012 


1 


8 


0.26 


2 


14 


0.79 


2 


10 


Average (median or mean) 




1.0 


0.7 


6.8 


0.80 


0.9 


7.0 


0.87 


1.0 


8.7 



The original and gapped BLAST comparisons use BLOSUM-62 substitution scores (18). All three programs use threshold 7 parameter set to 9, but the gapped 
BLAST and PSI-BLAST programs use the two-hit method to trigger ungapped extensions. The original BLAST program has the X dropoff parameter set to nominal 
score 23. The gapped BLAST and PSI-BLAST comparisons charge gaps of length k a cost of 10 + it. They have set to 16, and set to 40 for die database search 
stage and to 67 for the output stage of the algorithms. Gapped alignments are triggered by a score corresponding to -22 bits. For PSI-BLAST, the query is first com- 
pared to the SWISS-PROT database, and the position-specific score matrix generated is dien compared to a shuffled version of SWISS-PROT. The median is used 
for the average of the low £-values, and the mean otherwise. 
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Table 3. The number of SWISS-PROT sequences yielding alignments with f-value <0.01. and relative running times, for Smith-Waterman and various versions 
of BLAST 



Protein family 


Query 


S mi til- Waterman 


Original BLAST 


Gapped BLAST 


PSI-BLAST 


Serine protease 


P00762 


275 


273 


275 


286 


Serine protease inhibitor 


P01008 


108 


105 


108 


111 


Ras 


POllU 


255 


249 


252 


375 


Globin 


P02232 


28 


26 


28 


623 


Hemagglutinin 


P03435 


128 


114 


128 


130 


Interferon a 


P05013 


53 


53 


53 


53 


Alcohol dehydrogenase 


P07327 


138 


128 


137 


160 


Histocompatibility antigen 


P10318 


262 


241 


261 


338 


Cytochrome P450 


P10635 


211 


197 


211 


224 


Glutathione transferase 


PI 4942 


83 


79 


81 


142 


H+- transporting ATP synthase 


P20705 


198 


191 


197 


207 


Normalized running time 




36 


LO 


0.34 


0.87 



To score and evaluate the significance of the alignments found, the original BLAST program uses BLOSUM-62 substitution scores (1 8) and sum-statistics (21,22). 
The Smith-Waterman and gapped BLAST programs use BLOSUM-62 substitution scores, 10 + Jfc gap costs, and the statistics of equations 1 and 2, in conjunction 
with the experimentally determined parameters Xg = 0.255 and Kg = 0.035 (3). PSI-BLAST uses the same gap costs and A^, but applied to the position-specific score 
mauix constructed from the output of the gapped BLAST run. Only one PSI-BLAST iteration is executed. All three BLAST programs use the same parameter settings 
as in Table 2, except that T is set to 1 1 . Normalized running times are the mean ratio of program mnning time to that for the original BLAST. The time for PSI-BLAST 
includes the time for the mitial BLAST search. 



To compare the performance of the new gapped version of 
BLAST and its PSI-BLAST extension to that of the Smith- 
Waterman algorithm (24) and the original ungapped BLAST 
algorithm, we employed the same 1 1 query sequences that were 
used above to investigate the accuracy of PSI-BLAST statistics. 
Because, as shown, these statistics are quite accurate, we may use 
the number of statistically significant sequences found in a 
database search as a reasonable measure of ^gorithm sensitivity. 
We employed the ssearch program, version 2.0u54, from the 
Fasta package (2) as our implementation of the Smith-Waterman 
algorithm. Using each of the 11 queries, we searched SWISS- 
PROT with each of the four programs. We show in Table 3 the 
numbers of sequences found with E value <0.01, as well as the 
average ratio of running time to that for the original BLAST 
program. Based upon SWISS-PROT annotation, all sequences 
recorded in Table 3 appear to be tme family members, with the 
exception of one of the lowest-scoring alignments found by 
Smith-Waterman when applied to the histocompatibility antigen 
query, and the lowest-scoring alignment found by the original 
BLAST applied to the hemagglutinin query. While some 
alignments involve hypothetical proteins, the pattern of con- 
served residues in all such cases suggests a true positive. 

As can be seen, the gapped BLAST program runs on average 
three times faster then the original, and in all but one case 
examined finds a greater number of statistically significant 
alignments. It runs > 100 times faster than Smith- Waterman, but 
for the combined 11 queries misses only eight of the 1739 
significant similarities found by the rigorous algorithm. Of these 
eight, only one has an £'-value <0.00l, and another appears to be 
a random as opposed to a biologically meaningful similarity. The 
scores produced by gapped BLAST for the 1731 similarities it 
finds differ from those produced by the Smith-Waterman 
algorithm in only two instances. The discrepancy arises in both 
cases from an Kg parameter that is too low rather than from an 
incorrect choice of seed. Thus despite its simplicity, the 
seed-selection heuristic is extremely accurate. 



A search that includes a single PSI-BLAST iteration still mns 
faster than the original BLAST, and 40 times faster than 
Smith-Waterman, but can in many cases be much more sensitive. 
It finds every tme positive returned by Smith-Waterman, but 
frequently many others as well. Here only a single PSI-BLAST 
iteration has been considered but, as will be seen below, multiple 
iterations can yield even better results. Furthermore, we have 
found PSI-BLAST to perform better on searches of the non- 
redundant protein sequence database maintained by the NCBI 
(59) than on searches of SWISS-PROT, because of the greater 
number of significant similarities that are found by the initial 
BLAST run. 

For the particular examples in Table 3, the PSI-BLAST 
iteration takes noticeably longer than the gapped BLAST 
iteration, due primarily to the time needed to construct the 
position-specific score matrix from the large number of signifi- 
cant local alignments found by BLAST. For queries that return a 
small number of significant alignments, each PSI-BLAST 
iteration requires more nearly the same time as BLAST. 

PSI-BLAST EXAMPLES 

In many instances, PSI-BLAST is able automatically to uncover 
biologically interesting similarities that elude simple database 
searches. Multiple iterations of PSI-BLAST are sometimes required 
to recognize the more distandy related protein family members. We 
here consider two representative cases in greater detail. 

HIT proteins 

Hohn and Sander (60) describe how a comparison of three-dimen- 
sional stmctures is able to identify significant similarity between 
histidine triad (HIT) proteins and galactose- 1 -phosphate uridylyl- 
transferase (GalT) proteins. Indeed, using the human HIT protein 
(61) as query, a BLAST search of SWISS-PRCJT reveals hits with 
f-value <0.0I only to odier HIT proteins (Fig. 5a). An alignment to 



Nucleic Acids Research, 1997, Vol. 25, No. 17 3399 



the rat GalT protein (62) has the only marginally significant £-value 
of 0.012. A PSI-BLAST search, using the score noatrix generated 
from the six alignments Qlustrated in Figure 5a, can immediately 
cement confidence in the biological relevance of this similarity. The 
E-value of the similarity with rat GalT drops to 2 x 10^, and an 
alignment to Haemophilus influenzae GalT (63) (Fig. 5b) receives 
the even more significant E-value of 4 x 10~^. These similarities, of 
course, are uncovered using no structural information. In addition, 
on the next iteration, PSI-BLAST fmds a strongly significant 
alignment (Fig, 5c; E-value 2 x 10^) to yeast 5',5'"-Pl J^4-tetra- 
phosphate phosphorylase I (64), for which no stmcture is available. 

BRCT proteins 

Proteins containing one or multiple copies of the BRCT domain 
forma superfamily many of whose members are involved in DNA 
damage-responsive cell cycle checkpoints (65-67). While detailed 
analysis is needed to delineate completely this diverse set of 
proteins, PSI-BLAST is able to automatically identify most of the 
superfamily. We used the C-terminal 215 residues of human 
BRCAl (68), which includes two BRCT domains (65), as the 
initial query for a search of NCBFs non-redundant protein 
sequence database. Using the default cutoff value of 0.01. the 
initial BLAST search recognized as significant only alignments to 
other BRCAl sequences, and the previously described BRCT 
protein BARD (69) (Table 4). Subsequent PSI-BLAST iterations, 
however, retrieved all the proteins recorded in Table 4; additional 
close homologs are omitted from the table. Almost all die BRCT 
proteins described by Boric et al. (66) were recognized. Not found 
were the retinoblastoma family, whose putative BRCT domain is 
particularly divergent, worm R13A5.13, which was not in the 
database searched, and human DNA-ligase III. PSI-BLAST did 
report yeast RAD9 and YGRlOSw, the Kluyveromyces lactis 
RAP I homolog, worm ZK675.2, and various terminal deoxynu- 
cleotidyltransferases and poly(ADP-ribose) polymerases, but all 
with E-values >0.01 (Table 4). Detailed examination of the 
alignments produced suggests that the only likely false positives 
involved a trypanosome EST (70) and the Methanococcus 
jannaschii mutT protein (71), the latter despite its involvement in 
DNA repair (Table 4). 

Seven recent additions to the protein databases are reported here 
as members of the BRCT superfamily (Table 4). (i) Arabidopsis 
T10M13.12 (72). is the fust plant protein observed to contain 
BRCT domains, (ii) KIAA0259 (73) is a large human protein of 
unknown function with eight BRCT domains, the greatest number 
so far observed within a single protein, (iii) T13F2.3 (74) is a worm 
protein with a 500-residue low-complexity (57) N-terminus. 

(iv) SPAC6G9. 12 (75) is a fission yeast protein strongly similar to 
the previously recognized (66) yeast BRCT protein L8543. 18 (76). 

(v) C36A4.8 (74) is a worm protein whose C-terminus contains a 
single BRCT domain, and whose N-terminus, containing a RING 
finger domain, is strongly similar to that of BRCAl. The similar 
organization to BRCAl makes this protein of particular interest. 

(vi) Synechocystis sp. D90904 (77) is the fust bacterial BRCT 
protein that is not a bacterial ligase. While it failed to pass the cutoff 
E- value of 0.0 1 , its C-terminal BRCT domain is very similar to that 
of several bacterial ligases, wliich presunnably led to its incorrect 
classification as such in the databases. Most of the protein 
N-terminal to its BRCT domain consists of a coiled-coil domain. 
The actual Synechocystis sp. DNA ligase (IIJS) is found by 
PSI-BLAST on the 13th iteration, with an E-value of 0.002. 



H. sapiens BRCAl O 

A.thaliana T10M13.12 € E 

H. sapiens KIAA0259 ^^^^^^H-iM— ^1— IM^ 

C.elegans T13F2.3 CP ^ ^fi^^ 

S.pombe SPAC6G9.12 

C.elegans C36A4.8 Q ^ 

Synechocystis sp. D90904 0^ 

H. sapiens PescadillO i22 

Figure 7. The location of BRCT domains within human BRCAl (68), 
Arabidopsis T10M13.12 (72), human KIAA0259 (73), womi T13F2.3 (74), 
fission yeast SPAC6G9.I2 (75), wonn C36A4.8 (74), Synediocystis sp. 
D90904 (77) and human PescadiUo (79). BRCAl and C36A4.8 each have, in 
addition, an N-temiinal RING fmger domain. The near identity to other worm 
sequences of a short region directly preceding the BRCT domain of C36A4.8 
suggests the possibility that this protein has been misassembled. 

(vii) PescadiUo is a human protein whose zebrafish ortholog is 
essential for embryonic development (79), and whose yeast 
ortholog YGR103W (80) has been previously recognized as a 
BRCT protein (66,67). It failed to pass the cutoff £-value of 0.01 , 
but appeared with near-significant ^-values in PSI-BLAST output 
from the 5th iteration onward. The approximate positions of the 
BRCT domains within BRCAl and the seven newly identified 
BRCT proteins are illustrated in Figure 7. 

DISCUSSION AND CONCLUSION 

In addition to the major algorithmic changes described above, we 
have modified an aspect of the original BLAST program*s output 
routine that on occasion caused important similarities to be 
overlooked. When a very large niunber of statistically significant 
alignments was found, BLAST would typically report only the top 
scoring 500. These alignments, however, might all involve one 
domain of the query that occurred frequently within the database. 
Interesting but weaker relationships to other regions of the query 
might simply be forced off the bottom of the list. Accordingly, 
following the general idea of Sonnhammer and Durbin (81), we 
have limited the number of alignments reported that involve each 
region of the query, but set no overall upper limit. 

The BLAST programs are unlikely to remain static, and there 
are many possible avenues for future improvement. We discuss 
three of them briefly here. 

Gap costs 

Gapped ahgnments may be constmcted using a variety of different 
types of gap cost. Because a single mutational event may insert or 
delete a laige number of residues, it has been argued that long gaps 
should not cost much more than short ones, and affine gap costs, 
which assess a score -(a + bk) for a gap of length k (82-85), have 
become the most widely used. A generalization of these costs has 
been proposed, that allows a gap to involve residues in both 
sequences rather than just one (86). Specifically, a gap in which k 
residues are inserted or deleted and ; pairs of residues are left 
unaligned receives die score -(a + bk-\- cj). The algorithm necessary 
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for using such costs is only a minor variant on that for traditional 
affine gap costs. In many cases, the new gap costs generate local 
alignments that are both more accurate and more statistically 
significant (86). These costs are potentially of particular value for 
use with PSI-BLAST, because by imposing alignment only where 



it is justified, they may lead to the construction of more sensitive 
position-specific score matrices. Whether it is desirable to use 
generalized affine gap costs as the default for general purpose 
database searches awaits detailed empirical study. 



Table 4. PSI-BLAST protein database search results using the C-terminus of BRCAl as query 



Protein 


Species 


GenBank ID number 


PSI-BLAST iteration 


£- value 
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1710175 
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2e-06 


TlflMI T 1 
I I UlVl 1 J . 1 ^ 


Arab idopsis tfuiliana 
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1 
1 


*fe-uo 




Caenorhabditis elegans 


ly 141 /u 


1 
1 


A a r\A 




H . sapiens 


1 DO J /a J 


1 
1 


u.uu 1 


F37D6. 1 


C' elegans 






*te-uo 




Schizosaccharomyces pottibe 


17-1'lCrtl 


-> 


oc-uj 


KIAA017n 




1 1 XfJUi^ 


2 


0.002 


53BP1 


H . sapiens 
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U.UUo 


T13F2.3° 


Ct elegans 


100 / JJ*f 


\ 

J 


2e-07 


K04C2.4 




47035 1 


3 


3e-07 


T19E10.I 


K^'Cicganj 


lUO /UUJ 


4 






o.pofiioe 




4 




REVl 


Sacciiaromyces cerevisiae 


1 '^'?AAO 
1 ji*Hjy 


A 


U.UUJ 


ECT2 


Mus musculus 




c 


1 /» 

ie-i>r 


XRCCl 


A/, m usculus 




J 


rid 


Crb2 


S.pombe 


I449I77 


5 


0.002 


RAP I 


S. cerevisiae 


173558 


5 


0.006 


TcESTOSO*-' 


Trypanosoma cruzi 


1536857 


6 


0.001 


DPBIl 


Sxerevisiae 


1352999 


6 


0.001 


L8543.18 


S.cerevisiae 


1078075 


6 


0,010 


SPAC6G9.12a 


S.pombe 


1644324 


7 


4e-04 


YM8021.03 


S.cerevisiae 


1078533 


7 


0.005 


YHR154W 


S.cerevisiae 


731729 


7 


0.008 


C36A4.8^ 


C. elegans 


1657667 


7 


0.010 


UNE452 


S.cerevisiae 


1151000 


8 


8e-04 


DNA ligase IV 


H. sapiens 


1706482 


8 


0.008 


CDC9 


Candida albicans 


1706483 


9 


0.006 


DNA ligase 


Thermus scotoductus 


1352293 


10 


0.010 


GNFI 


Drosophila melanogaster 


544404 


11 


0.004 


mutT^ 


M.jannaschii 


2129134 


15 


0.008 


RAD9 


S.cerevisiae 


131817 


7 


0.74 


RAPl homolog 


K.laciis 


422087 


9 


0.21 


ZK675.2 


C.elegans 


599712 


13 


3.5 


D90904a 


Synecliocystis sp. 


1652299 


15 


0.17 


TDT 


Mus domes tica 


2149634 


15 


0.46 


YGR103W 


S.cerevisiae 


1723693 


16 


0.017 


Pescadillo^ 


H. sapiens 


2194203 


16 


0.017 


PPOL 


Sarcopliaga peregrina 


1709741 


16 


0.060 



Iteration zero refers to the initial BLAST nin, using the 215 C-terminal residues of BRCAl (68) (SWISS-PROT accession no. 
P38398) as query. Subsequent PSI-BLAST iterations use derived position-specific score matrices in place of the query. The score 
matrix for iteration / + 1 is constructed from alignments achieving an £- value <0.01 for iteration 1. For each protein, the £- value is 
that returned during the PSI-BLAST iteration indicated, and precedes the protein's use for score matrix construction. Only one repre- 
sentative is listed for families of closely related proteins. On its 16th iteration PSI-BLAST uncovered no new proteins with £- value 
<0.0 1 , and therefore ceased iteration. At the end of the table are shown BRCT proteins returned by PSI-BLAST with £"- value >0.0 1 
but <10, listed for the iteration in which they achieved their lowest £- value. 
"Recent additions to the database, first identified as BRCT proteins here. 

*The C.elegansF16D2.h protein (74) while a recent addition to the databases, is a close homolog of the previously recognized (66,67) 
family of C.elegans BRCT proteins containing, for example, F37A4-4 (90). 

*The trypanosome EST (70) and the M.jannaschii mutT protein (7 1 ) are the only likely false positives. 
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Position-specific score matrices as input to PSI-BLAST 

PSI-BLAST performs three distinct operations: it constructs a 
multiple alignment from BLAST output data; it processes diis 
alignment into a position-specific score matrix; and it uses diis 
matrix to searcii the database. A researcher may wish, however, to 
bypass the first two of these operations, and provide a score matrix 
as query directiy to PSI-BLAST. The central difficulty is retaining 
the ability to calculate reliable statistics; as described above, 
PSI-BLAST imposes strict scaling rules on the matrices it generates, 
permitting the use of precomputed Xg to assess significance. Three 
possible routes are open, (i) One may permit the specification of 20 
target frequencies Qi for each position rather than 20 scores. These 
can then be converted internally to log-odds scores widi the 
appropriate scale for precomputed statistical parameters to apply. 

(ii) One may estimate by random simulation the statistical para- 
meters for an input score matrix (3). An advantage is the 
applicability to a greater range of scoring systems, including the 
possible use of position-specific gap costs. A disadvantage is that 
obtaining reasonably accurate estimates of the relevant statistical 
parameters nnay increase die program's execution time unduly. 

(iii) One may alDandon the statistical assessment of die alignments 
produced This gives die greatest scope to input scoring systems, but 
precludes any reasonable scheme for automatic iteration of PSI- 
BLAST. Much experimentation will be required to determine which 
of tiiese approaches is the most fruitful. 

Realignment 

After the initial BLAST mn, or a later PSI-BLAST iteration, the 
multiple alignment used for subsequent iterations can be con- 
structed in a more sophisticated manner than described above. 
Rather than coalescing all the pairwise alignments that pass the 
threshold immediately into a multiple alignment, the most 
significant among them can be used to build an initial multiple 
alignment and associated position-specific score matrix, which 
can then be used to rescore and realign database sequences that 
received lower scores. This step can be iterated several times 
before another full-scale database search is executed. There are 
several potential advantages to this procedure, (i) Weaker 
painvise alignments, that may be somewhat inaccurate, can be 
improved and perhaps extended before they are incorporated into 
the evolving multiple alignment, (ii) Unrelated sequences that 
received chance high scores can have their scores downgraded by 
an improved matrix, and perhaps be rejected before they are 
included in the alignment, (iii) Related sequences that received 
relatively high alignment scores, but missed the threshold for 
inclusion, can have their scores increased, and perhaps be 
included in the multiple alignment. In short, the realignment 
procedure can prevent inaccurate pairwise alignments from 
corrupting the evolving multiple alignment, and can accelerate 
the recognition of related sequences, all for very little computa- 
tional cost. Preliminary studies suggest this line of development 
to be a fmitful one. 

In conclusion, the new gapped version of BLAST is both 
considerably faster than the original one, and able to produce 
gapped alignments. While the relevant statistical parameters can 
no longer be calculated from theory, random simulation allows 
them to be estimated beforehand for commonly used amino acid 
substitution matrices and gap costs. For many queries, the 
PSI-BLAST extension can greatly increase sensitivity to weak 
but biologically relevant sequence relationships. PSI-BLAST 



retains the ability to report accurate statistics, per iteration runs in 
times not much greater than gapped BLAST, and can be used both 
iteratively and fully automatically. These developments should 
enhance significantly the utility of database search methods to the 
molecular biologist. 

Note 

Source code for the new BLAST programs is available by 
anonymous ftp from the machine ncbi.nlm.nih.gov, within the 
directory *blast*, and the programs may be run from NCBFs web 
site at http://www.ncbi.nlm.nih.gov/ 
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